author: “Mhairi McNeill”
date: “4/06/2019”
Duration - 30 minutes
This week we’re going to learn all about working with data across a huge range of formats and storage options. In this very first lesson you’ll get a brief overview of some of the ways data can be stored. The flow chart below is a very basic summary of your options.
This is a rather superficial overview of data storage - you can spend a whole career learning how to make these decisions well. We’re only going to discuss the size and structure of your data and not cover more advanced topics like speed, availability of data and privacy.
When deciding where to store data, you first need to decide if you will store it locally (on your laptop), or will you store it remotely (either on a server or some cloud platform). If your dataset is very big you will be forced to store it remotely. However, there are other reasons to store data remotely. You might want to share the data with others in your organisation and have a ‘single source of truth’, that you can all refer to.
Data which is stored locally can be much harder to control. If you need to guarantee that security and privacy of the data you work with - and due to GDPR and other regulations, you will need to do this for personal information - then remote storage is a better solution.
If you are storing data locally, you main options are between structured and unstructured ways of storing the data. We discussed the difference between structured and unstructured back in week one.
If you can store your data in a structured way, that is almost always better! The structure makes it much easier to query and manipulate your data.
Databases, including SQL databases and NoSQL databases are a structured way of storing data. They are associated with remote data storage, however it’s perfectly possible and indeed sometimes sensible to store your data in a local database that exists only on your computer. In fact we’ll be creating one this week.
Delimited files are a broad range class of methods of storing ’rectangular data` i.e. data that comes in rows and columns. Delimited files are plain text files which mark rows and columns using some kind of delimiter. Comma separated variable files (CSVs) are a very common type of delimited file where columns are separated with commas and rows with new lines.
Other commonly used delimiters are tabs, semicolons (;), colons (:) or pipes (|). These are more widely used in countries where the comma is used as a decimal separator.
An example of a tab delimited file.
JSON files are also just plain text files with special characters that provide structure. However they are more suitable for more complex data which isn’t in a rectangle.
JSON is very commonly used for data from the internet. Your web browser will receive JSON files from websites.
An example of JSON data
Throughout the history of computers, many data storing standards have been developed. There is countless other ways of storing data, too many to mention here. The big advantage of text files like delimited files and JSON is that they are widely used and many different programs can open them, which makes them a good way of storing and sharing data.
Traditionally all database data was stored in a SQL database. SQL stands for Structured Query Language, a special language for reading and changing data. You’ll be learning some SQL later this week.
SQL databases work with ‘tables’, so like delimited files they prefer their data to come in rectangles made of rows and columns. However, SQL databases tend to have links between tables, which lets them store data with a more complicated structure.
An SQL database
The alternative to SQL Databases is noSQL databases. Originally, noSQL stood for ‘not SQL’, but now it’s often referred to as ‘not only SQL’, as many noSQL databases can be accessed in a SQL-like way.
noSQL databases generally do not have a table structure. Some of them store data in a similar way to JSON, others act like a labelled document store.
An E
If you have very, very large data then you may need to use some type of distributed storage system, where your data is stored across many servers. You would then access the data using a system like Hadoop or MapReduce. This is only necessary when your data is very large, on the scale of terabytes rather than gigabytes.
Task - 15 mins
In small groups discuss the best way of storing the following pieces of data. For many of these there is no one right answer.
A simple dataset showing the attendance at your union meeting for the past 3 months.
The entirety of Wikipedia (about 60 gigabytes of data).
All the papers you needed for writing your PhD Thesis.
Your HR data for a company with 1000 employees.
Every news article for a large newspaper.
A dataset on the world wide life expectancy in the last 50 years.
Answer
- Local - probably a delimited file, or an Excel spreadsheet.
- Would probably need to be stored remotely if you want to use it. A noSQL database would probably be more suitable.
- Local (possibly with remote backup!) unstructured.
- Local, probably a SQL database, or a remote commercial cloud solution.
- Remote, probably noSQL. Here’s an interesting article about the Guardian switching from using a noSQL database to a SQL database: https://www.theguardian.com/info/2018/nov/30/bye-bye-mongo-hello-postgres
- Local, probably a delimited file.
As well as storing data remotely, you might also work in the upsidedown That is your applications will run on a separate server, that you can access from your own computer. This is often known as cloud computing. RStudio, the program you are using to write R code, has a cloud version.
This can be particularly advantageous for big companies; they can significantly reduce their IT infrastructure costs because they no longer need as many local technicians. Working in the cloud means everyone in the company will have consistent versions and updating to the latest version is much easier.
However, there are disadvantage too. For example, you will need an internet connection to do your work. Cloud working can have greater privacy and security issue. If your cloud server is hosted by a third party, your company will have less control over how you work and the programs you can use.
Why would you use remote storage over local storage? Answer
Your data is too large to store locally, or you want other people to have access to it, or to be able to combine it with other remote data easily.
Why would you use unstructured data over structured? Answer
If you have no option!
Why would you choose noSQL over SQL? Answer
Your data doesn’t fit neatly into a table structure. It might be possible to redesign your data structure to achieve this. We will be covering designing data structures in week 10.
Why would you choose JSON over delimited? Answer
Again, because your data doesn’t neatly fit into a table, or you would like to have flexible field sizes.
Why would you distributed storage over a database? Answer
Your data is too large to fit into a database